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' \ Naturally occurring networks exhibit quantitative features revealing underly- 

ljIj ' ing growth mechanisms. Numerous network mechanisms have recently been 
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proposed to reproduce specific properties such as degree distributions or clus- 
tering coefficients. We present a method for inferring the mechanism most 



O 

00 ■ accurately capturing a given network topology, exploiting discriminative tools 

o ■ 
o 



from machine learning. The Drosophila melanogaster protein network is con- 



O ■ fidently and robustly (to noise and training data subsampling) classified as a 

duplication-mutation-complementation network over preferential attachment, 

^ ■ small-world, and other duplication-mutation mechanisms. Systematic classifi- 

V " 

■ cation, rather than statistical study of specific properties, provides a discrimi- 

\ 

native approach to understand the design of complex networks. 

1 Introduction 

Recent advances in our understanding of biological networks have often focused on understand- 
ing the emergence of specific features such as scale-free degree-distributions (li 121.3 1) , short mean 
geodesic lengths or clustering coefficents @. The insights gained into the topological patterns 
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have motivated various network growth and evolution models in order to determine what simple 
mechanisms can reproduce the features observed. Among these are the preferential attachment 
model (0|5|) exhibiting scale-free degree distributions, and the small-world model exhibiting 
high clustering coefficients and short mean geodesies. Moreover, various duplication-mutation 
mechanisms have been proposed to describe biological networks (l6ll7ll^ l9ll70ll77Tl and the World 
Wide Web (17^ . However, in most cases model parameters can be tuned such that multiple mod- 
els of widely varying mechanisms perfectly fit the motivating real network in terms of single 
selected features such as the scale-free exponent and the clustering coefficient. Since networks 
with several thousands of vertices and edges are highly complex, it is also clear that these fea- 
tures can only capture limited structural information. 

Here, we make use of discriminative classification techniques recently developed in ma- 
chine learning <il3\\14l to classify a given real network as one of many proposed network mech- 
anisms by enumerating local substructures. Determining what simple mechanism is responsible 
for a natural network's architecture would (i) facilitate the development of correct priors for con- 
straining network inference and reverse engineering (I75lli6lli7lli^l) : (ii) specify the appropriate 
null model relative to which one evaluates statistical significance (Ii 9%20i2 112212 312412 512(^2% : 
(iii) guide the development of improved network models; and (iv) reveal underlying design prin- 
ciples of evolved biological networks. It is therefore desirable to develop a method to determine 
which proposed mechanism models a given complex network without prior feature selection. 

Enumeration of subgraphs has been succesfuUy used to find network motifs U9\\20\\n\\T2\ 
m\\2^U5\[2^[T7[ during the past few years and is historically a well established method in 
the sociology community SIE^ . Recently, the idea of clustering real networks based on their 
"significance profiles" has been proposed (29). The method assumes randomized networks 
with fixed degree distribution as the null model to estimate the statistical significance of given 
subgraphs. The significance profiles are then shown to be similar for various groups of naturally 
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occurring networks. 

Finding statistically significant motifs and clustering can both be characterized as schemes 
to identify a reduced-complexity description of the networks. We here present an approach 
which is instead predictive, in which labeled graphs of known growth mechanisms are used as 
training data for a discriminative classifier. This classifier, then, presented with a new graph of 
interest, can reliably and robustly predict the growth mechanism which gave rise to that graph. 
Within the machine learning community, such predictive, supervised learning techniques are 
differentiated from descriptive, unsupervised learning techniques such as clustering. 

We apply our method to the recently-published Drosophila melanogaster protein-protein 
interaction network (l30b and find that a duplication-mutation-complementation mechanism @ 
best reproduces Drosophila'^ network. The classification is robust against noise, even after 
random rewiring of 45% of the network edges. To validate, we also show that beyond 80% 
random rewiring the correct (Erdos-Renyi) classification is obtained. 

2 Methods 
2.1 The data set 

We use a protein-protein interaction map based on yeast two-hybrid screening Since the 
data set is subject to numerous false positives, Giot et al. assign a confidence score p G [0, 1], 
measuring how likely the interaction occurs in vivo. In order to exclude unlikely interactions 
and focus on a core network which retains significant global features, we determine a confidence 
threshold p* based on percolation: measurements of the size of the components for all possible 
values of p* show that the two largest components are connected for p* = 0.65 (see supple- 
mental material). Edges in the graph correspond to interactions for which p > p*. To reveal 
possible structural changes in Drosophila for less stringent thresholds, we also present results 
for p* = 0.5 as suggested in We remove self-interactions from the network since none 
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of the proposed mechanisms allow for them. After eliminating isolated vertices the resulting 
networks consist of 3359 (4625) vertices and 2795 (4683) edges for = 0.65 (0.5). 

2.2 Network mechanisms 

We create 7000 graphs as training data, 1000 for each of seven different models drawn from the 
literature. Every graph is generated with the same number of edges and number of vertices as 
measured in Drosophila; all other existing parameters are sampled uniformly (|37T) . The models 
manifest various simple network mechanisms, many of which explicitly intend to model protein 
interaction networks. 

The duplication-mutation-complementation ^ (DMC) algorithm is inspired by an evolu- 
tionary model of the genome i32l\33l proposing that most of the duplicate genes observed today 
have been preserved by functional complementation. If either the gene or its copy loses one 
of its functions (edges), the other becomes essential in assuring the organism's survival. There 
is thus an increased preservation of duplicate genes induced by null mutations. The algorithm 
features a duplication step followed by mutations that preserve functional complementarity. At 
every time step one chooses a vertex v at random. A twin vertex Vtwin is then introduced copy- 
ing all of f 's edges. For each edge of v, one deletes with probability qdei either the original 
edge or its corresponding edge of Vtwin- The twins themselves are conjoined with an indepen- 
dent probability Qcon, representing an interaction of a protein with its own copy. Note that no 
new edges are created by mutations. The DMC mechanism thus assumes that the probability of 
creating new advantageous functions by random mutations is negligible. 

A slightly different implementation of duplication-mutation is realized in (0) using random 
mutations (DMR). Possible interactions between twins are neglected. Instead, edges between 
Vtwin and the neighbors of v can be removed with a probability qdei and new edges can be created 
at random between f and any other vertices with a probability Qnew/N, N being the current 
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total number of vertices. DMR thus emphasizes the creation of new advantageous functions by 
mutation. 

Additionally, we create training data for linear preferential attachment (LPA) networks (El|5|) 
(growing graphs with a probability of attaching to previous vertices proportional to A; + a, a 
being a constant parameter, and k the degree of the chosen vertex), random static networks 
(RDS) i34& (also known as Erdos-Renyi graphs; vertices are connected randomly), random 
growing networks (RDG) (l?5t (growing graphs where new edges are created randomly be- 
tween existing vertices), aging vertex (AGV) networks (l36b (growing graphs modeling citation 
networks, where the probability for new edges decreases with the age of the vertex), and small- 
world (SMW) networks (0) (interpolation between regular ring lattices and randomly connected 
graphs). For descriptions of the specific algorithms we refer the reader to the supplemental ma- 
terial. 

2.3 Subgraph census 

We quantify the topology of a network by exhaustive subgraph census (37) up to a given sub- 
graph size; note that we do not assume a specific network randomization nor test for statistical 
significance as in (li 9ll20IL2i IL22II23II24II25II26IL2 71) . but we classify network mechanisms using the 
raw subgraph counts. Rather than choosing most important features a priori, we count all possi- 
ble subgraphs up to a given cut-off, which can be made either in the number of vertices, number 
of edges, or the length of a given walk. To show insensitivity to this choice, we present results 
for two different cut-offs. We first count all subgraphs that can be constructed by a walk of 
length eight (148 non-isomorphic' subgraphs); second, we consider all subgraphs up to a total 
number of seven edges (130 non-isomorphic subgraphs). Their counts are the input features for 

our classifier. It is worth noting that the mean geodesic length (average shortest path between 

^Two graphs are isomorphic if there exists a relabehng of their vertices such that the two graphs are identical. 
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two vertices) of the Drosophila network's giant component is 11.6 (9.4) for p* = 0.65 (0.5). 
Walks of length eight are therefore able to traverse large parts of the network and can also reveal 
global structures. 

2.4 Learning algorithm 

Our classifier is a generalized decision tree called an Alternating Decision Tree (ADT) &^ 
which uses the Adaboost (l39t algorithm to learn the decision rules and associate weights to 
them. Adaboost is a general discriminative learning algorithm proposed in 1997 by Freund and 
Schapire (l?Dll?Pb . and has since been successfully used in numerous and varied applications 
(e.g., in text categorization (|4ill42ll and gene expression prediction It is equivalent to an 

additive logistic regression model 

An example of an ADT is shown in Figure[T] A given network's subgraph counts determine 
paths in the tree dictated by inequalities specified by the decision nodes (rectangles). For each 
class, the ADT outputs a real-valued prediction score, which is the sum of all weights over all 
paths. The class with the highest score wins. The prediction score y{c) for class c is related to 
the probability p{c) for the tested network to be in class c by p(c) = e^^*^'^^ /(I + e^^*^'^-') 
(The supplemental material gives details on the exact learning algorithm.) 

An advantage of ADTs is that they do not assume a specific geometry of the input space; 
that is, features are not coordinates in a metric space (as in support vector machines or k- 
nearest-neighbors classifiers), and the classification is thus independent of normalization. The 
algorithm assumes neither independence nor dependence among subgraph counts. The features 
distinguish themselves solely by their individual abilities to discriminate different classes. 



6 



3 Results 



We perform cross-validation (UTIIT^ with multi-class ADTs, thus determining an empirical 
estimate of the generalization error, the probability of mislabeling an unseen test datum. The 
confusion matrix in Table[l]shows truth and prediction for the test sets. Five out of seven classes 
have nearly perfect prediction accuracy. Since AGV is constructed to be an interpolation be- 
tween LPA and a ring lattice, the AGV, LPA and SMW mechanisms are equivalent in specific 
parameter regimes and correspondingly show a non-negligible overlap. Nevertheless, the over- 
all prediction accuracy on the test sets still lies between 94.6% and 95.8% for different choices 
of p* and subgraph size cut-off. Note that preferential attachment is completely distinguishable 
from duplication-mutation despite the fact that a duplication mechanism introduces an ejfec- 
tive preferential attachment (l?7]|?5t . Even models that are based on the same fundamental 
mechanism, like duplication-mutation in DMC and DMR, are perfectly separable. Only small 
algorithmic changes in network mechanisms can thus give rise to easily detectable differences 
in substructures. Figure IHconfirms that although many of these models have similar degree dis- 
tributions, clustering coefficients, or mean geodesic lengths, they have indeed distinguishable 
topologies. 

Figured shows the first few decision nodes (out of 120) of a resulting ADT. The prediction 
scores reveal that a high count of 3-cycles suggest a DMC network (node 3). The DMC mech- 
anism indeed facilitates the creation of many 3-cycles by allowing two copies to attach to each 
other, thus creating 3-cycles with their common neighbors. In particular a few combinations 
are good predictors for some classes. For example, a low count in 3-cycles but a high count in 
8-edge linear chains is a good predictor for LPA and DMR networks (nodes 3 and 4). Due to 
the sparseness of the networks the preferential attachment does not lead to a clustered structure. 
While LPA readily yields hubs, cycles are less probable. (More detailed ADTs can be viewed 
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in the supplemental material.) 

Having built a classifier enjoying good prediction accuracy, we can now determine the net- 
work mechanism that best reproduces the Drosophila protein network (or in priniciple any net- 
work of same size) using the trained ADTs for classification. Table|2lgives the prediction scores 
of the Drosophila network for each of the seven classes, averaged over folds. 

The duplication-mutation-complementation mechanism is the only class having a positive 
prediction score in every case. In particular for p* = 0.65 the DMC classification has a high 
score of 8.2 and 8.6. Also, the comparatively small standard deviations over different folds 
indicate robustness of the classification against data subsampling. While the high rankings 
of both duplication-mutation classes confirm our biological understanding of protein network 
evolution, our findings strongly support an evolution restricted by functional complementarity 
over an evolution that creates and deletes functions at random. 

Interestingly for p* = 0.65 the RDG mechanism of random growth (edges are connected 
randomly between existing vertices) has a higher prediction score than the LPA or AGV growing 
graph mechanisms. Growth without any underlying mechanism other than chance therefore 
generates networks closer in topology to the core network (p* = 0.65) of Drosophila than 
growth governed by preferential attachment. We also emphasize that the small- world character 
of high clustering and short mean geodesic length, often attributed to biological networks (iJOl 
l?6b . is not enough to conclude that the given network is close to the small- world model © (an 
interpolation between regular ring lattices and randomly connected graphs), as shown here. The 
classification for p* = 0.5 is less confident probably due to the additional noise present in the 
data when including low p-value (improbable) interactions, as we discuss below. 

While not necessary for the classification itself, visualizing subgraph profiles can give a 
qualitative and more intuitive way of interpreting the classification result and a better under- 
standing of the topological differences between Drosophila and each of the seven mechanisms. 
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We plot in Figure |3l their color-coded subgraph counts, averaged over all 1000 realizations of 
every model, for a representative subset of 50 subgraphs^. We group together subgraphs (indi- 
cated by black lines) that exhibit the smallest absolute difference between the average subgraph 
count for the model, and for Drosophila. For 60% of the subgraphs (S1-S30), Drosophila's 
counts are closest to DMC's. All of these subgraphs contain one or more cycles, including 
highly connected subgraphs such as K4 (Sl)^, and long linear chains ending in cycles (SI 6, 
SI 8, S22, S23, S25). DMC is the only mechanism that can give rise to the high occurrences of 
cycles measured in Drosophila. Owing to the networks' sparseness cyclic structure is unlikely 
to be generated in LPA, AGV, SMW, and RDS. The models LPA and AGV, however, are close to 
Drosophila's topology according to subgraphs S44-S50 featuring open-ended chains and hubs, 
which occur frequently in both models as well as in Drosophila. 

Since yeast two-hybrid data is known to be susceptible to numerous errors fW^, proposed 
inference methods are only reliable if they are robust against noise. To confirm that our method 
shows this property, we classify the Drosophila network for various levels of artificially-introduced 
noise by replacing existing edges with random ones. Figure |5l shows the prediction scores for all 
seven classes as functions of the fraction of edges replaced. As validation, the network is cor- 
rectly classified as an RDS graph when all edges are randomized. About 30% of Drosophila's 
edges can be replaced without seeing any significant change in all seven prediction scores, and 
about 45% can be replaced before Drosophila is no longer classified as a DMC network. At 
this point the prediction scores of DMC, DMR and AGV are very close, which is also observed 
for the prediction scores for p* = 0.5 (see TableEl), where they rank top three in this order. The 
results therefore suggest that the less confident classification for p* = 0.5 could be mainly due 
to the presence of more noise in the data after inclusion of low p-value edges. 

We have presented a method to infer growth mechanisms for real networks. Advantageous 

^We refer to the supplemental material for the whole set of 148 subgraphs 
^a completely connected subgraph of four nodes 
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properties include robustness both against noise and data subsampling, and the absence of any 
prior assumptions about which network features are important. Moreover, since the learning 
algorithm does not assume any relationships among features, the input space can be augmented 
with various features in addition to subgraph counts. We find that the Drosophila protein in- 
teraction network is confidently classified as a DMC network, a result which strongly supports 
ideas presented by Vazquez et al. @ and Force et al. OHl about the nature of genetic evolu- 
tion. Recently, Wang et al. presented direct experimental evidence for a single DMC event in 
Drosophila melanogaster (l?7t . We anticipate that further use of machine learning techniques 
will answer a number of questions of interest in systems biology. 
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Table 1 : Confusion matrix for tested networks using five-fold cross-validation (17^ . Entries 
show the probability of predicting class j given that the true class is i. The training data 
is based on the size of the Drosophila protein network with a confidence treshold of p* = 0.5, 
the input features of the classifier being counts of all possible walks of length eight. The overall 
prediction accuracy is 95.8%. Prediction errors among AGV, LPA and SMW networks are due 
to equivalence of the models in specific parameter regimes. 
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Table 2: Prediction scores for the Drosophila protein network for different confidence thresh- 
olds p* and different cut-offs in subgraph size. Drosophila is consistently classified as a DMC 
network, with an especially strong prediction for a confidence threshold of p* = 0.65 and inde- 
pendently of the cut-off in subgraph size. 
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Figure 1: Alternating decision tree: The first few nodes of one of the trained ADTs are 
shown. At every boosting iteration one new decision node (rectangle) with its two prediction 
nodes (ovals) is introduced. Every test network follows several paths in the tree dictated by 
inequalities in the decision nodes (S# refers to a specific subgraph count; see Figure |2l). The 
final score is the sum of all prediction scores over all paths and the class with the highest 
prediction score wins. 
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Figure 2: Subgraphs associated with Figures |3l and HI A representative subset of 50 sub- 
graphs out of 148 is shown. 
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Figure 3: Subgraph profiles. The average subgraph count of the training data for every mech- 
anism is shown for 50 representative subgraphs. The labels S1-S50 refer to Figure El Black 
lines indicate that this model is closest to Drosophila based on the absolute difference between 
the subgraph counts. 
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Figure 4: Discriminating similar networks: Ten graphs of two different mechanisms exhibit 
similar average geodesic lengths and almost identical degree dstribution and clustering coef- 
ficients, (a) cumulative degree distribution p(A; > ko), average clustering coefficient (C) and 
average geodesic length (L), all quantities averaged over a set of ten graphs, (b) prediction 
scores for all ten graphs and all five cross-validated U3i ADTs. The two sets of graphs can be 
perfectly separated by our classifier. 
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Figure 5: Robustness against noise: Edges in Drosophila are randomly replaced and the 
network is reclassified. Plotted are prediction scores for each of the seven classes as more and 
more edges are replaced. Every point is an average over 200 independent random replacements. 
For high noise levels (beyond 80%) the network is classified as an Erdos-Renyi (RDS) graph. 
Also note that the confidence in the classification as a DMC network for low noise (less than 
30%) is even higher than in the classification as an RDS network for high noise. The prediction 
score y{c) for class c is related to the estimated probability p{c) for the tested network to be in 
class c by p(c) = e'^y^'^^/il + e^^'W) (44). 



18 



